STALKER: Learning Wrappers for Semistructured, Web-based Information Sources
نویسندگان
چکیده
Information mediators are systems capable of providing a unified view of several information sources. Central to any mediator that accesses Web-based sources is a set of wrappers that can extract relevant information from Web pages. In this paper, we present a wrapper-induction algorithm that generates extraction rules for Web-based information sources. We introduce landmark automata, a formalism that describes classes of extraction rules. Our wrapper induction algorithm, STALKER, generates extraction rules that are expressed as simple landmark grammars, which are a class of landmark automata that is more expressive than the existing extraction languages. Based on just a few training examples STALKER learns extraction rules for documents with multiple levels of embedding. The experimental results show that our approach successfully wraps classes of documents that can not be wrapped by existing techniques.
منابع مشابه
Active Learning for Hierarchical Wrapper Induction
Information mediators that allow users to integrate data from several Web sources rely on wrappers that extract the relevant data from the Web documents. Wrappers turn collections of Web pages into database-like tables by applying a set of extraction rules to each individual document. Even though the extraction rules can be written by humans, this is undesirable because the process is tedious, ...
متن کاملAutoWrapper: automatic wrapper generation for multiple online services
A crucial challenge for information extraction from the WWW is to generate wrappers, which are information extraction patterns or rules, which apply to numerous Web sites with great diversity in both format and content. Generating wrappers manually is tedious, time consuming and errorprone. Recent research has successfully adapted machine learning technology to generate wrappers for semi-struct...
متن کاملAutomatically Generated DAML Markup for Semistructured Documents
The semantic web is becoming a realizable technology due to the efforts of researchers to develop semantic markup languages such as the DARPA Agent Markup Language (DAML). A major problem that faces the semantic web community is that most information sources on the web today lack semantic markup. To fully realize the potential of the semantic web, we must find a way to automatically upgrade inf...
متن کاملFINITE - STATE TRANSDUCERS FOR SEMI - STRUCTUREDDATA EXTRACTION FROM THE WEByChun
| Integrating a large number of Web information sources may signiicantly increase the utility of the WorldWide Web. A promising solution to the integration is through the use of a Web Information mediator that provides seamless, transparent access for the clients. Information mediators need wrappers to access a Web source as a structured database, but building wrappers by hand is impractical. P...
متن کاملQuery Processing in Heterogeneous Information Sources
The thesis presents a system that provides integrated access to heterogeneous information sources that may contain unstructured or semistructured data that are not described by a regular schema (e.g., the World-Wide-Web). The sources may have di erent and limited query capabilities and complete knowledge of their contents and structure may not exist. First an abstraction is proposed for the rep...
متن کامل